This document has the purpose to get familiar with the stock market data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pandas import DataFrame
key = '88D9DSE169BZ1ZID'
from alpha_vantage.techindicators import TechIndicators
from alpha_vantage.timeseries import TimeSeries
We will be using the Alpha Vantage API for multiple reasons
- real time data that will be updated everytime we work with the data
- accurate data for a wide variety of different stocks
- have built-in functions for different types of analysis
ts = TimeSeries(key, output_format = 'pandas')
prices, meta = ts.get_daily_adjusted('FB', outputsize='full')
prices.columns = ['Open', 'High', 'Low', 'Close', 'Adjusted Close', 'Volume', 'Divident Amount', 'Split Coefficient']
prices.sort_index(inplace=True)
prices.head(10)
| Open | High | Low | Close | Adjusted Close | Volume | Divident Amount | Split Coefficient | |
|---|---|---|---|---|---|---|---|---|
| date | ||||||||
| 2012-05-18 | 42.050 | 45.00 | 38.00 | 38.2318 | 38.2318 | 573576400.0 | 0.0 | 1.0 |
| 2012-05-21 | 36.530 | 36.66 | 33.00 | 34.0300 | 34.0300 | 168192700.0 | 0.0 | 1.0 |
| 2012-05-22 | 32.610 | 33.59 | 30.94 | 31.0000 | 31.0000 | 101786600.0 | 0.0 | 1.0 |
| 2012-05-23 | 31.370 | 32.50 | 31.36 | 32.0000 | 32.0000 | 73600000.0 | 0.0 | 1.0 |
| 2012-05-24 | 32.950 | 33.21 | 31.77 | 33.0300 | 33.0300 | 50237200.0 | 0.0 | 1.0 |
| 2012-05-25 | 32.900 | 32.95 | 31.11 | 31.9100 | 31.9100 | 37149800.0 | 0.0 | 1.0 |
| 2012-05-29 | 31.480 | 31.69 | 28.65 | 28.8400 | 28.8400 | 78063400.0 | 0.0 | 1.0 |
| 2012-05-30 | 28.695 | 29.55 | 27.86 | 28.1900 | 28.1900 | 57267900.0 | 0.0 | 1.0 |
| 2012-05-31 | 28.545 | 29.67 | 26.83 | 29.6000 | 29.6000 | 111639200.0 | 0.0 | 1.0 |
| 2012-06-01 | 28.892 | 29.15 | 27.39 | 27.7200 | 27.7200 | 41855500.0 | 0.0 | 1.0 |
prices.tail(10)
| Open | High | Low | Close | Adjusted Close | Volume | Divident Amount | Split Coefficient | |
|---|---|---|---|---|---|---|---|---|
| date | ||||||||
| 2021-03-26 | 278.3018 | 284.500 | 277.77 | 283.02 | 283.02 | 17629185.0 | 0.0 | 1.0 |
| 2021-03-29 | 285.7700 | 293.180 | 284.70 | 290.82 | 290.82 | 21718831.0 | 0.0 | 1.0 |
| 2021-03-30 | 289.8300 | 292.470 | 286.70 | 288.00 | 288.00 | 17474537.0 | 0.0 | 1.0 |
| 2021-03-31 | 289.9900 | 296.500 | 288.61 | 294.53 | 294.53 | 19498157.0 | 0.0 | 1.0 |
| 2021-04-01 | 298.4000 | 302.400 | 296.60 | 298.66 | 298.66 | 17615980.0 | 0.0 | 1.0 |
| 2021-04-05 | 300.8900 | 310.765 | 300.68 | 308.91 | 308.91 | 28237006.0 | 0.0 | 1.0 |
| 2021-04-06 | 308.8365 | 311.350 | 305.25 | 306.26 | 306.26 | 17335246.0 | 0.0 | 1.0 |
| 2021-04-07 | 306.3400 | 314.250 | 305.50 | 313.09 | 313.09 | 22855241.0 | 0.0 | 1.0 |
| 2021-04-08 | 314.8500 | 315.880 | 310.05 | 313.02 | 313.02 | 20894148.0 | 0.0 | 1.0 |
| 2021-04-09 | 311.4000 | 314.740 | 310.33 | 312.46 | 312.46 | 15988568.0 | 0.0 | 1.0 |
So this is the data that we have by the API. To start, let's first get to know what each column means.
Open is the value of the given stock when the stock market opened on a given day.
High is the highest value that the given stock has reached during the course of the given day.
Low is the opposite of high - lowest value of the stock on this day.
Close is the value of the given stock when the stock market closes on the given day.
Adjusted Close is the actual close value since 'Close' in this data is a raw data.
Volume indicates the amount of shares that were bought or sold on the given day.
Divident Amount refers to a reward, cash or otherwise, that a company gives to its shareholders.
Split coefficient is when a company divides the existing shares of its stock into multiple new shares to boost the stock's liquidity.
For this project I will use the Facebook Stock because:
- one of the most famous stocks on the market
- has a long history of records (since 2012)
Let's begin by checking for any missing data
# Find columns with missing values and their percent missing
prices.isnull().sum()
miss_val = prices.isnull().sum().sort_values(ascending=False)
miss_val = pd.DataFrame(data=prices.isnull().sum().sort_values(ascending=False), columns=['MissvalCount'])
# Add a new column to the dataframe and fill it with the percentage of missing values
miss_val['Percent'] = miss_val.MissvalCount.apply(lambda x : '{:.2f}'.format(float(x)/prices.shape[0] * 100))
miss_val = miss_val[miss_val.MissvalCount > 0].style.background_gradient(cmap='Reds')
miss_val
| MissvalCount | Percent |
|---|
Since there is no single row in in this data we can assume that there are not any missing values in our dataframe.
plt.figure(figsize = (15,7))
plt.plot(prices['Close'])
plt.title('Closing price')
plt.show()
Above the close value of the Facebook stock could be seen. However, I have tested the API for several other stocks and for some of them the Close values are sometimes incorrect. Initially the data that I used only had this closing values but after reading some documentation I found that this 'Close' data is raw and if I need the actual closing value I will need different call to the API to get different dataset.
#from matplotlib.pyplot import figure
plt.figure(figsize = (16,7))
plt.plot(prices['Adjusted Close'])
plt.title('Real closing price')
plt.show()
Now we can see the accurate closing prices for Facebook since the stock became public in mid 2012. On first look there appears to be no difference but there are in fact little differences. Let's display it.
plt.figure(figsize = (16,7))
plt.plot(prices['Close'])
plt.plot(prices['Adjusted Close'])
plt.title('Difference between raw closing value and adjusted closing value')
plt.legend(['Close', 'Adjusted Close'])
plt.show()
The lines overlap which means that the Close value in this data is pretty accurate even though it is raw. This is a very good sign, because this means that the other values like High, Low and Open will also be accurate.
Let's use some functions to extract the moving averages of the Facebook stock from the API
ti = TechIndicators(key, output_format='pandas')
fb_moving_average_monthly, meta = ti.get_sma('FB', interval='monthly', time_period = 2, series_type='close')
fb_moving_average_weekly, meta = ti.get_sma('FB', interval='weekly', time_period = 2, series_type='close')
print(fb_moving_average_monthly, fb_moving_average_weekly)
SMA date 2012-07-31 26.4025 2012-08-31 19.8840 2012-09-28 19.8590 2012-10-31 21.3850 2012-11-30 24.5550 ... ... 2020-12-31 275.0650 2021-01-29 265.7450 2021-02-26 257.9750 2021-03-31 276.0750 2021-04-09 303.4950 [106 rows x 1 columns] SMA date 2012-06-01 29.8150 2012-06-08 27.4100 2012-06-15 28.5570 2012-06-22 31.5320 2012-06-29 32.0725 ... ... 2021-03-12 266.3400 2021-03-19 279.2550 2021-03-26 286.5650 2021-04-01 290.8400 2021-04-09 305.5600 [463 rows x 1 columns]
plt.figure(figsize = (16,7))
plt.plot(fb_moving_average_monthly)
plt.title('Monthly simple moving average(SMA) of Facebook')
plt.show()
This Simple Moving Average(SMA) represents the mean of the data set for a given period. In that case the SMA's period is one month. SMAs are part of the technical analysis in predicting the future value of a stock. Let's plot it on the same chart as the price movement.
plt.figure(figsize = (16,7))
plt.plot(fb_moving_average_monthly)
plt.plot(prices['Adjusted Close'])
plt.legend(['Monthly moving average', 'Close'])
plt.title('Comparison of SMA and daily closing prices')
plt.show()
Such graphs can show very interesting data. For example, a change in direction of trend can be indicated by the penetration/crossover of the SMA. Generally a buy signal is generated when a price breaks above the moving average and sell signal is generated by a price break below the moving average. It is added confirmation when the moving average line turns in the direction of the price trend. When there is a high increase or decrease in a short period of time the SMA does not catch up immediately. That is because it uses average monthly values. Owing to this fact the Monthly SMA is good choice for long term predictions but less good for short term ones.
plt.figure(figsize = (16,7))
plt.plot(fb_moving_average_monthly)
plt.plot(fb_moving_average_weekly)
plt.title('Difference between weekly and daily SMA of Facebook')
plt.legend(['montly SMA', 'weekly SMA'])
plt.show()
The weekly SMA looks more like the daily stock prices of Facebook. For this reason it can be better to use it for possible predictions as it is more sensitive to change in trends.
import plotly.graph_objects as go
fig = go.Figure(data=go.Ohlc(x=prices.index.date,
open=prices['Open'],
high=prices['High'],
low=prices['Low'],
close=prices['Close']))
fig.show()
This OHLC chart(for open, high, low and close) is a style of financial chart describing the open, high, low and close values for a given date. The tip of the lines represent the low and high values and the horizontal lines repsent the open and close values. When there is an increase in the stock for a given day the lines are green and when there is a decrease - red. The bar below the graph can be used to filter the dates and observe a specific time period if needed.
Heatmaps are very useful to find relations between two variables in a dataset and this way the user gets a visualisation of the numeric data. No correlations are extremely high. Each square shows the correlation between the variables on each axis. Let's visualize a heatmap on all features in the data set.
df_temp = prices.copy()
Index= ['Open', 'High', 'Low', 'Close', 'Volume', 'Divident Amount', 'Split Coefficient']
Cols = ['Open', 'High', 'Low', 'Close', 'Volume', 'Divident Amount', 'Split Coefficient']
df_temp = DataFrame(abs(np.random.randn(7, 7)), index=Index, columns=Cols)
plt.figure(figsize=(16, 7))
sns.set_context('paper', font_scale=1.3)
sns.heatmap(df_temp.corr(), annot=True, cmap = 'magma')
plt.show()
The closer to 1 the correlation is the more positively correlated they are, that is as one increases so does the other and the closer to 1 the stronger this relationship is. It is noticeable that the correlation between Open and Close is high, which mean that they have a quite strong correlation.
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(prices.corr()[['Close']].sort_values(by='Close', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with close value', fontdict={'fontsize':18}, pad=16);
Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two variables. The closer to 1 the correlation is the more positively correlated they are, that is as one increases so does the other and the closer to 1 the stronger this relationship is. In that case most of the values correlate fully with each other because the price of stock usually moves along the other features like low, high and open.
Both the heatmap and the correlation diagram show that there is no big relation between the volume of a stock and it's price. However, my research shows that Volume is a big part of the Technical analysis when a prediction is made for a specific stock. This is why we are going to display the volume of the Facebook stock now.
plt.figure(figsize = (16,7))
plt.plot(prices['Volume'])
plt.title('Volume of Facebook stock in 100 mil')
plt.show()
Even though in the heatmap and the correlation diagram the volume does not seem so important to the closing value it does have a huge impact on the price because this value represents the amount of time the stock has been bought/sold during the day. Let's try to display both values.
plt.figure(figsize = (16,7))
plt.plot(prices['Volume']/1000000)
plt.plot(prices['Close'])
plt.title('Relation of price to volume in 1 mil')
plt.show()
As expected, on the graph we can see that when there is a drastic change in the volume there is big impact on the price which either increases or decreases. That confirms the fact that Volume is an important feature when predicting the price.
plt.figure(figsize = (16,7))
plt.plot(prices['Divident Amount'])
plt.title('Dividents given by Facebook')
plt.show()
It appears that Facebook that never given dividents to its shareholders. In that case it doesn't make sense to include it in the model, but since other companies do give dividents it would not be wise to drop such column so fast.
#prices['Split Coefficient'].value_counts()
plt.figure(figsize = (16,7))
plt.plot(prices['Split Coefficient'])
plt.title('Split coefficient of Facebook')
plt.show()
Just like the dividents, facebook has not split it's shares. Futher research is needed to see what potential impact does these two values have on the price of a stock.
Post research:
- If a split occurs a stock's price is affected. After a split, the stock price will be reduced (since the number of shares outstanding has increased). For example if we have a 2 for 1 stock split(let's say the company had 10 million shares outstanding before the split, now it will have 20 million) the price of the stock will halve its price.
- Before a dividend is distributed, the issuing company must first declare the dividend amount and the date when it will be paid. The declaration of a dividend naturally encourages investors to purchase stock which leads to increase to the stocks value.
Conclusion
Some very interesting patterns were found in the dataset, including ones that the heatmap and the correlation diagram could not find, but thanks to the domain research, I was able to find them. The next steps include data preparation. Thankfully the API returns very good data and currently I have not experienced problems with missing values in the dataset. Yet possibilities for combining this dataset with other data might be possible to see if I can find any other relations that could help in making the prediction as accurate as possible.